Retrieval of Spelling Variants in Nonstandard Texts – Automated Support and Visualization

نویسندگان

  • Thomas Pilz
  • Wolfram Luther
  • Ulrich Ammon
  • THOMAS PILZ
  • WOLFRAM LUTHER
  • ULRICH AMMON
چکیده

This article describes ongoing research in the RSNSR (Regelbasierte Suche in Textdatenbanken mit nichtstandardisierter Rechtschreibung, “Rule-based search in text databases with nonstandard orthography”) project. The focus of this project is making historical text documents digitally available; consequently, it examines the challenges for digitization procedures and subsequent retrieval operations, like fuzzy full-text search. Difficulties are posed by scans of low quality facsimiles, old font types, inconsistent transcriptions and especially typical optical character recognition (OCR) errors and spelling variation. This article discusses recent solutions to such problems, concentrating on stochastic string edit distance measures, so-called evidences and the avoidance of static dictionaries. By presenting visualization approaches for retrieval in and browsing of historical databases and nonstandard text documents, as well as a prototype for visual evaluation of distance measures, it proposes a progression of information visualization in linguistics.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

دیداری کردن نتایج جست‌وجو در فرایند بازیابی اطلاعات

Purpose: One of the most effective ways to achieve optimum information retrieval is through visualization of Information. Search strategies, probing skills, querying of information needs and analysis of information play a significant role in the accessing of necessary and useful information. Besides the factors mentioned above, information visualization can increase the availability level of in...

متن کامل

Comparison of distance measures for historical spelling variants

This paper describes the comparison of selected distance measures in their applicability for supporting retrieval of historical spelling variants (hsv). The interdisciplinary project Rule-based search in text databases with nonstandard orthography develops a fuzzy fulltext search engine for historical text documents. This engine should provide easier text access for experts as well as intereste...

متن کامل

Generating Search Term Variants for Text Collections with Historic Spellings

In this paper, we describe a new approach for retrieval in texts with non-standard spelling, which is important for historic texts in English or German. For this purpose, we present a new algorithm for generating search term variants in ancient orthography. By applying a spell checker on a corpus of historic texts, we generate a list of candidate terms for which the contemporary spellings have ...

متن کامل

Building a Corpus-based Historical Portuguese Dictionary: Challenges and Opportunities

Historical corpora are important resources for different areas. Philology, Human Language Technology, Literary Studies, History, and Lexicography are some that benefit from them. However, compiling historical corpora is different from compiling contemporary corpora. Corpus designers have to deal with several characteristics inherent in historical texts, such as: absence of a spelling standard, ...

متن کامل

The Identification of Spelling Variants in English and German Historical Texts: Manual or Automatic?

The identification of spelling variants in English and German historical texts: manual or automatic? Dawn ARCHER (University of Central Lancashire) Andrea ERNST-GERLACH, Sebastian KEMPKEN, Thomas PILZ (Universität Duisburg-Essen) Paul RAYSON (Lancaster University) The identification of spelling variants in English and German historical texts: manual or automatic?

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009